In [2]:
source("https://bioconductor.org/biocLite.R")
In [3]:
biocLite("MLSeq")
In [5]:
citation("MLSeq")
Out[5]:
In [4]:
library(MLSeq)
In [6]:
filepath = system.file("extdata/cervical.txt", package = "MLSeq")
filepath
Out[6]:
In [7]:
cervical = read.table(filepath, header = TRUE)
In [8]:
head(cervical[, 1:5])
Out[8]:
In [9]:
class(cervical)
Out[9]:
In [10]:
dim(cervical)
Out[10]:
First 29 columns of the data contain the miRNA mapped counts of non-tumor samples, while the last 29 comlumns contain the count information of tumor samples. We need to create a class label in order to apply classification models.
In [11]:
class = data.frame(condition = factor(rep(c("N", "T"), c(29, 29))))
as.factor(class[, 1])
Out[11]:
For simplicity, we can work with a subset of cervical data using the first 150 features.
In [12]:
data = cervical[c(1:150), ]
Now we can split the data into training and test sets. The thraining set can be used to build classification models, and the test set can be used to assess the performance of each model. We use the set.seed function to specify initiala values of a random-number seed and use the sample function for selection.
In [13]:
nTest = ceiling(ncol(data) * 0.2)
set.seed(12345)
ind = sample(ncol(data), nTest, FALSE)
Now, training and test sets can be created based on this sampling process.
In [18]:
data.train = data[, -ind]
data.train = as.matrix(data.train + 1)
classtr = data.frame(condition = class[-ind, ])
data.test = data[, ind]
data.test = as.matrix(data.test + 1)
classts = data.frame(condition = class[ind, ])
Now we have 46 samples which will be used to train the classification models and have 12 remaining samples to test model performance.
In [19]:
dim(data.train)
Out[19]:
In [20]:
dim(data.test)
Out[20]:
In [21]:
data.trainS4 = DESeqDataSetFromMatrix(countData = data.train, colData = classtr, formula(~condition))
data.trainS4 = DESeq(data.trainS4, fitType = "local")
data.trainS4
Out[21]:
In [22]:
data.testS4 = DESeqDataSetFromMatrix(countData = data.test, colData = classts, formula(~condition))
data.testS4 = DESeq(data.testS4, fitType = "local")
data.testS4
Out[22]:
In [23]:
svm = classify(data = data.trainS4, method = "svm", normalize = "deseq", deseqTransform = "vst", cv = 5, rpt = 3, ref = "T")
svm
In [25]:
install.packages("kernlab", repos='http://archive.linux.duke.edu/cran/')
In [27]:
install.packages("e1071", repos='http://archive.linux.duke.edu/cran/')
In [28]:
svm = classify(data = data.trainS4, method = "svm", normalize = "deseq", deseqTransform = "vst", cv = 5, rpt = 3, ref = "T")
svm
Out[28]:
After running the code given above, we obtain the results in MLSeq class. SVM successfully fits a model with 97.8% true classifcation accuracy by misclassifying only one non-tumor sample.
"svm" object also stores information about model training and the parameters used to build this model.
In [29]:
getSlots("MLSeq")
Out[29]:
In [30]:
trained(svm)
Out[30]:
Now let's build a Random Forest model
In [32]:
rf <- classify(data = data.trainS4, method = "randomforest", normalize = "deseq", deseqTransform = "vst", cv = 5, rpt= 3, ref = "T")
rf
Out[32]:
In [33]:
pred.svm = predictClassify(svm, data.testS4)
pred.svm
Out[33]:
In [34]:
pred.rf = predictClassify(rf, data.testS4)
pred.rf
Out[34]:
In [35]:
table(pred.svm, relevel(data.testS4$condition, 2))
table(pred.rf, relevel(data.testS4$condition, 2))
Out[35]:
Out[35]:
In [36]:
sessionInfo()
Out[36]:
In [37]:
confusionMatrix(table(pred.svm, relevel(data.testS4$condition, 2)))
Out[37]: